Firmament: Fast, Centralized Cluster Scheduling at Scale
نویسندگان
چکیده
Centralized datacenter schedulers can make high-quality placement decisions when scheduling tasks in a cluster. Today, however, high-quality placements come at the cost of high latency at scale, which degrades response time for interactive tasks and reduces cluster utilization. This paper describes Firmament, a centralized scheduler that scales to over ten thousand machines at subsecond placement latency even though it continuously reschedules all tasks via a min-cost max-flow (MCMF) optimization. Firmament achieves low latency by using multiple MCMF algorithms, by solving the problem incrementally, and via problem-specific optimizations. Experiments with a Google workload trace from a 12,500-machine cluster show that Firmament improves placement latency by 20× over Quincy [22], a prior centralized scheduler using the same MCMF optimization. Moreover, even though Firmament is centralized, it matches the placement latency of distributed schedulers for workloads of short tasks. Finally, Firmament exceeds the placement quality of four widely-used centralized and distributed schedulers on a real-world cluster, and hence improves batch task response time by 6×.
منابع مشابه
Dynamic Load Balancing Through Coordinated Scheduling in Packet Data Systems
Third generation code-division multiple access (CDMA) systems propose to provide packet data service through a high speed shared channel with intelligent and fast scheduling at the base-stations. In the current approach base-stations schedule independently of other base-stations. We consider scheduling schemes in which scheduling decisions are made jointly for a cluster of cells thereby enhanci...
متن کاملPartitioned Parallel Job Scheduling for Extreme Scale Computing
Recent success in building petascale computing systems poses new challenges in job scheduling design to support cluster sizes that can execute up to two million concurrent tasks. We show that for these extreme scale clusters the resource demand at a centralized scheduler can exceed the capacity or limit the ability of the scheduler to perform well. This paper introduces partitioned scheduling, ...
متن کاملMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Datacenter-scale computing for analytics workloads is increasingly common. High operational costs force heterogeneous applications to share cluster resources for achieving economy of scale. Scheduling such large and diverse workloads is inherently hard, and existing approaches tackle this in two alternative ways: 1) centralized solutions offer strict, secure enforcement of scheduling invariants...
متن کاملFailure Prediction and Scalable Checkpointing for Reliable Large-Scale Grid Computing
Computational clusters, the grids that federate them, and the applications that utilize their significant computing potential, all continue to grow with advances in hardware technology, cluster management, and grid middleware solutions. As they do, the likelihood that large-scale long-running grid and cluster applications will have to deal with underlying node unavailability and cluster failure...
متن کاملCentrally Controlled Clustered Wireless Sensor Networks
We present IMPERIA, a centrally managed architecture for large-scale wireless sensor networks (WSN). Within the WSN, sensor nodes communicate using a clustered multihop TDMA protocol, which globally synchronizes the network and collects data at ultra-low power consumption. The novel contributions to the state-of-the-art include a) an efficient algorithm for network topology discovery and link q...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016